20 research outputs found

    A GPU-Computing Approach to Solar Stokes Profile Inversion

    Full text link
    We present a new computational approach to the inversion of solar photospheric Stokes polarization profiles, under the Milne-Eddington model, for vector magnetography. Our code, named GENESIS (GENEtic Stokes Inversion Strategy), employs multi-threaded parallel-processing techniques to harness the computing power of graphics processing units GPUs, along with algorithms designed to exploit the inherent parallelism of the Stokes inversion problem. Using a genetic algorithm (GA) engineered specifically for use with a GPU, we produce full-disc maps of the photospheric vector magnetic field from polarized spectral line observations recorded by the Synoptic Optical Long-term Investigations of the Sun (SOLIS) Vector Spectromagnetograph (VSM) instrument. We show the advantages of pairing a population-parallel genetic algorithm with data-parallel GPU-computing techniques, and present an overview of the Stokes inversion problem, including a description of our adaptation to the GPU-computing paradigm. Full-disc vector magnetograms derived by this method are shown, using SOLIS/VSM data observed on 2008 March 28 at 15:45 UT

    Multi-threaded dense linear algebra libraries for low-power asymmetric multicore processors

    Full text link
    [EN] Dense linear algebra libraries, such as BLAS and LAPACK, provide a relevant collection of numerical tools for many scientific and engineering applications. While there exist high performance implementations of the BLAS (and LAPACK) functionality for many current multi-threaded architectures, the adaption of these libraries for asymmetric multicore processors (AMPs) is still pending. In this paper we address this challenge by developing an asymmetry-aware implementation of the BLAS, based on the BLIS framework, and tailored for AMPs equipped with two types of cores: fast/power-hungry versus slow/energy-efficient. For this purpose, we integrate coarse-grain and fine-grain parallelization strategies into the library routines which, respectively, dynamically distribute the workload between the two core types and statically repartition this work among the cores of the same type. Our results on an ARM (R) big.LITTLE (TM) processor embedded in the Exynos 5422 SoC, using the asymmetry-aware version of the BLAS and a plain migration of the legacy version of LAPACK, experimentally assess the benefits, limitations, and potential of this approach from the perspectives of both throughput and energy efficiency. (C) 2016 Elsevier B.V. All rights reserved.The researchers from Universidad Jaume I were supported by projects CICYT TIN2011-23283 and TIN2014-53495-R of MINECO and FEDER, and the FPU program of MECD. The researcher from Universidad Complutense de Madrid was supported by project CICYT TIN2015-65277-R. The researcher from Universitat Politecnica de Catalunya was supported by projects TIN2015-65316-P from the Spanish Ministry of Education and 2014 SGR 1051 from the Generalitat de Catalunya, Dep. dinnovacio, Universitats i Empresa.Catalán, S.; Herrero, JR.; Igual Peña, FD.; Rodríguez-Sánchez, R.; Quintana Ortí, ES.; Adeniyi-Jones, C. (2018). Multi-threaded dense linear algebra libraries for low-power asymmetric multicore processors. Journal of Computational Science. 25:140-151. https://doi.org/10.1016/j.jocs.2016.10.020S1401512

    Fine-grained bit-flip protection for relaxation methods

    Full text link
    [EN] Resilience is considered a challenging under-addressed issue that the high performance computing community (HPC) will have to face in order to produce reliable Exascale systems by the beginning of the next decade. As part of a push toward a resilient HPC ecosystem, in this paper we propose an error-resilient iterative solver for sparse linear systems based on stationary component-wise relaxation methods. Starting from a plain implementation of the Jacobi iteration, our approach introduces a low-cost component-wise technique that detects bit-flips, rejecting some component updates, and turning the initial synchronized solver into an asynchronous iteration. Our experimental study with sparse incomplete factorizations from a collection of real-world applications, and a practical GPU implementation, exposes the convergence delay incurred by the fault-tolerant implementation and its practical performance.This material is based upon work supported in part by the U.S. Department of Energy (Award Number DE-SC-0010042) and NVIDIA. E. S. Quintana-Orti was supported by project CICYT TIN2014-53495-R of MINECO and FEDER.Anzt, H.; Dongarra, J.; Quintana Ortí, ES. (2019). Fine-grained bit-flip protection for relaxation methods. Journal of Computational Science. 36:1-11. https://doi.org/10.1016/j.jocs.2016.11.013S11136Chow, E., & Patel, A. (2015). Fine-Grained Parallel Incomplete LU Factorization. SIAM Journal on Scientific Computing, 37(2), C169-C193. doi:10.1137/140968896Karpuzcu, U. R., Kim, N. S., & Torrellas, J. (2013). Coping with Parametric Variation at Near-Threshold Voltages. IEEE Micro, 33(4), 6-14. doi:10.1109/mm.2013.71Bronevetsky, G., & de Supinski, B. (2008). Soft error vulnerability of iterative linear algebra methods. Proceedings of the 22nd annual international conference on Supercomputing - ICS ’08. doi:10.1145/1375527.1375552Sao, P., & Vuduc, R. (2013). Self-stabilizing iterative solvers. Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - ScalA ’13. doi:10.1145/2530268.2530272Calhoun, J., Snir, M., Olson, L., & Garzaran, M. (2015). Understanding the Propagation of Error Due to a Silent Data Corruption in a Sparse Matrix Vector Multiply. 2015 IEEE International Conference on Cluster Computing. doi:10.1109/cluster.2015.101Chazan, D., & Miranker, W. (1969). Chaotic relaxation. Linear Algebra and its Applications, 2(2), 199-222. doi:10.1016/0024-3795(69)90028-7Frommer, A., & Szyld, D. B. (2000). On asynchronous iterations. Journal of Computational and Applied Mathematics, 123(1-2), 201-216. doi:10.1016/s0377-0427(00)00409-xDuff, I. S., & Meurant, G. A. (1989). The effect of ordering on preconditioned conjugate gradients. BIT, 29(4), 635-657. doi:10.1007/bf01932738Aliaga, J. I., Barreda, M., Dolz, M. F., Martín, A. F., Mayo, R., & Quintana-Ortí, E. S. (2014). Assessing the impact of the CPU power-saving modes on the task-parallel solution of sparse linear systems. Cluster Computing, 17(4), 1335-1348. doi:10.1007/s10586-014-0402-

    Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems

    Get PDF
    General-purpose GPU-based systems are highly attractive, as they give potentially massive performance at little cost. Realizing such potential is challenging due to the complexity of programming. This article presents a compiler-based approach to automatically generate optimized OpenCL code from data parallel OpenMP programs for GPUs. A key feature of our scheme is that it leverages existing transformations, especially data transformations, to improve performance on GPU architectures and uses automatic machine learning to build a predictive model to determine if it is worthwhile running the OpenCL code on the GPU or OpenMP code on the multicore host. We applied our approach to the entire NAS parallel benchmark suite and evaluated it on distinct GPU-based systems. We achieved average (up to) speedups of 4.51Ă— and 4.20Ă— (143Ă— and 67Ă—) on Core i7/NVIDIA GeForce GTX580 and Core i7/AMD Radeon 7970 platforms, respectively, over a sequential baseline. Our approach achieves, on average, greater than 10Ă— speedups over two state-of-the-art automatic GPU code generators

    Mastering system and power measures for servers in datacenter

    Get PDF
    International audienceUsing power meters and performance counters to get insight on system's behavior in terms of power consumption is common nowadays. The values coming from these external or internal meters are usually used directly by the research community, for instance to derive higher-level power models with learning techniques or to use them in decision tools such as schedulers in HPC and Cloud Computing. While it is reasonable when one wants only to have a broad view on the power consumption, they can not be used directly in most cases: We prove in this article that the problems of distributed measure and hardware limits are way more complex and create bias, and we give the keys to understand and chose the proper methodology to handle these bias to obtain relevant values for enhanced usage. A generic methodology is analyzed and its main lessons extracted for a direct usage by the research community to master system and power measures for servers in datacenter
    corecore